Introduction to Web Scraping and Data Management for Social Scientists

Session 5: Scaling, Reporting and Database Software

Johannes B. Gruber

2024-07-26

Introduction

This Course

tinytable_eycjbty8k3svjcyx2l39
Day Session
1 Introduction
2 Data Structures and Wrangling
3 Working with Files
4 Linking and joining data & SQL
5 Scaling, Reporting and Database Software
6 Introduction to the Web
7 Static Web Pages
8 Application Programming Interface (APIs)
9 Interactive Web Pages
10 Building a Reproducible Research Project

The Plan for Today

In this session, you learn:

  • Repetition: DBMS
  • Working with PostgreSQL
  • Working with text databases
  • Benchmarking
  • Final scaling tips

Nik via unsplash.com

Databases

DBMS: servers and clients

  • most DBMS are set up in client-server architecture:
    • server: can be a computer somewhere or a process on your own computer that fullfills requests
    • DBMS server: contains the database and database management system
    • client: interacts with server (sends requests, receives responses)
    • DBMS client: can upload and retrieve data from server or send processing instructions
  • Why though:
    • server can run on more powerful hardware somewhere else
    • requests from multiple users don’t interfere with each other
    • access control per user to support different roles

Interacting with a database management system from Weidmann (2023), p. 105.

Got Server?

https://azure.microsoft.com/en-us/free/

RDBMS software

  • many popular choices alternatives:
    • Oracle Database
    • MySQL
    • Microsoft SQL Server
    • IBM Db2
    • Microsoft Access
    • SQLite (free software)
    • MariaDB (free software)
  • all use slightly different dialects of SQL, but the core functionality is the same
  • We will use PostgreSQL: free and open source, well-known, and many other programming languages and tools can interface to it

Working with PostgreSQL

Installation

We are using Docker to spin up a local server that has PostgreSQL already installed.

You have to type this into your Terminal(!):

docker-compose -f 05_Scaling_Reporting_and_Database_Software/data/docker-compose.yml up -d

This is running the compose file below:

services:
  postgres:
    image: postgres
    container_name: postgres_db
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: pgpasswd
      POSTGRES_DB: dbintro
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

Connecting from R

library(DBI)
library(RPostgres)
db <- dbConnect(
  Postgres(),
  dbname = "dbintro",
  host = "localhost",  # Make sure to specify the host
  port = 5432L,
  user = "postgres",
  password = "pgpasswd"
)

Let’s fill this database with the nycflights13 flights data:

library(nycflights13)
dbWriteTable(db, "airports", airports, overwrite = TRUE)
dbWriteTable(db, "flights", flights, overwrite = TRUE)
dbWriteTable(db, "weather", weather, overwrite = TRUE)
dbWriteTable(db, "planes", planes, overwrite = TRUE)
dbWriteTable(db, "airlines", airlines, overwrite = TRUE)

Connecting from the terminal

docker exec -it -u postgres postgres_db psql

Inside the Docker container (that is the name of a server running via docker), you can list tables with:

\dt

You can run SQL operations in here without R:

SELECT * FROM airports;
#                     ^

Note that commands are only executed when SQL encounters a ;!

  • Exit a long print with q

Working with PostgreSQL: pretty similar to SQLite…

But not the same…

Error : Failed to fetch row : ERROR:  table "df1" does not exist
Error : Failed to fetch row : ERROR:  table "df2" does not exist
dbExecute(db,
          "CREATE TABLE df1 (
              id SERIAL PRIMARY KEY,
              capital_letters VARCHAR(1) CHECK (capital_letters ~ '^[A-Z]$'), 
              my_date DATE CHECK (my_date > '2000-01-01')
          )")
[1] 0

Working with PostgreSQL and dbplyr: the same as SQLite

I copied this from the last session, and it works just as well:

tbl(db, "flights") |> 
  inner_join(tbl(db, "planes"), by = "tailnum", suffix = c("", "_plane")) |> 
  mutate(plane_age = year - year_plane) |> 
  select(arr_delay, plane_age) |> 
  filter(!is.na(arr_delay),
         !is.na(plane_age)) |> 
  collect() |> 
  group_by(plane_age) |> 
  summarise(avg_delay = mean(arr_delay)) |> 
  ggplot(aes(x = plane_age, y = avg_delay)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x)

User-based authentication

Imagine this:

  • you have a team of researchers
  • some are responsible for data gathering (annotation, webscraping, conducting survey waves, etc.)
  • some are responsible for data analysis
  • you are responsible for data management and public reporting
  • You want to make sure that:
    • researchers in team gathering do not overwrite each others changes
    • team analysis always has the newest data
    • the summary data on the website is online as soon as possible
  • with PostgreSQL you can make sure the two groups don disrupt each other and the most recent results are pull from the database automatically

User-based authentication: users

We create three new users:

  • one for the one researchers in the gatherer group
  • one for the one analyser in the analysis group
  • one called “reader” which represents the general public
dbExecute(db, "CREATE USER gatherer WITH ENCRYPTED PASSWORD 'supersecret'")
[1] 0
dbExecute(db, "CREATE USER analyser WITH ENCRYPTED PASSWORD 'supersecret'")
[1] 0
dbExecute(db, "CREATE USER reader WITH ENCRYPTED PASSWORD 'supersecret'")
[1] 0

We can look at users:

dbGetQuery(db, "SELECT usename FROM pg_user")
   usename
1 postgres
2 gatherer
3 analyser
4   reader

User-based authentication: users

Let’s log in as analyser:

db_analyser <- dbConnect(
  Postgres(),
  dbname = "dbintro",
  host = "localhost",  # Make sure to specify the host
  port = 5432L,
  user = "analyser",
  password = "supersecret"
)

User-based authentication: roles

So far, neither of the new users can do anything:

tbl(db_analyser, "flights")
Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
ℹ Using SQL: SELECT * FROM "flights" AS "q01" WHERE (0 = 1)
Caused by error:
! Failed to fetch row : ERROR:  permission denied for table flights

They have to be assigned roles first

User-based authentication: roles

We first give the “gatherer” user permission to update a table:

dbExecute(db, "GRANT UPDATE,INSERT ON flights TO gatherer")
[1] 0

Since we do not fully trust the analyser, we give him read-only access to flights, but access to everything in df1, which we pretend that the results of the analysis are stored in:

dbExecute(db, "GRANT SELECT ON flights TO analyser")
[1] 0
dbExecute(db, "GRANT ALL PRIVILEGES ON df1 TO analyser")
[1] 0

Finally, the public gets selected access to only some columns in the results table:

dbExecute(db, "GRANT SELECT (capital_letters) ON df1 TO reader")
[1] 0

User-based authentication: roles (analyser)

Still logged in as the analyser, let’s try to access the data again:

tbl(db_analyser, "flights")
# Source:   table<"flights"> [?? x 19]
# Database: postgres  [analyser@localhost:5432/dbintro]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

But we still can’t make changes:

dbExecute(db_analyser,
          "INSERT INTO flights (year)
            VALUES (2013)")
Error: Failed to fetch row : ERROR:  permission denied for table flights

User-based authentication: roles (gatherer)

Logging in as the gatherer, we can add new cases:

db_gatherer <- dbConnect(
  Postgres(),
  dbname = "dbintro",
  host = "localhost",  # Make sure to specify the host
  port = 5432L,
  user = "gatherer",
  password = "supersecret"
)
dbExecute(db_gatherer,
          "INSERT INTO flights (year)
            VALUES (2013)")
[1] 1

User-based authentication: roles (reader)

Logging in as the reader, we can can’t change anything, but can only read specific columns:

db_reader <- dbConnect(
  Postgres(),
  dbname = "dbintro",
  host = "localhost",  # Make sure to specify the host
  port = 5432L,
  user = "reader",
  password = "supersecret"
)
dbGetQuery(db_reader, "SELECT * FROM df1")
Error: Failed to fetch row : ERROR:  permission denied for table df1
dbGetQuery(db_reader, "SELECT capital_letters FROM df1")
[1] capital_letters
<0 rows> (or 0-length row.names)

Now we could give out this user to the public without the need to worry that they change or read anything they are not supposed to.

Exercises 1

Using the PostgreSQL database or the SQLite database from session 4:

  1. Right join results_state and facts using dbplyr instead of dbGetQuery
  2. Recreate the table results_state_time by querying and joining from db (using dbplyr instead of dbGetQuery).
  3. Recreate results_state_facts using dbplyr. Don’t forget to add total_votes and pct_votes
  4. Extract the SQL query from your code in 1. and run it with dbGetQuery

Working with text data in AmCAT

Why AmCAT

  • Optimized to store, annotate, preprocess, search, share and present text data collections
  • For teams or individual researchers
  • Fine-grained access control
  • Free and Open Source

Installation

Same as with PostgreSQL, we can use Docker.

  1. Download the compose file:
curl::curl_download(
  url = "https://raw.githubusercontent.com/ccs-amsterdam/amcat4docker/main/docker-compose.yml", 
  destfile = "data/docker-compose-amcat.yml"
)
  1. Spin it up via the Terminal
docker-compose -f 05_Scaling_Reporting_and_Database_Software/data/docker-compose-amcat.yml up -d

Connecting from the terminal

Creating a test index:

docker exec -it amcat4 amcat4 create-test-index

Configure the AmCAT server:

docker exec -it amcat4 amcat4 config

Connecting from R

We need to log in, which you can do without a user by default:

# remotes::install_github("ccs-amsterdam/amcat4r")
library(amcat4r)
amcat_login("http://localhost/amcat")

We can have a look at the example corpus with:

query_documents(index = "state_of_the_union")
# A tibble: 200 × 3
   .id      title                   date               
   <id_col> <chr>                   <dttm>             
 1 RHq…s95  1790: George Washington 1790-01-08 00:00:00
 2 RXq…s96  1790: George Washington 1790-12-08 00:00:00
 3 Rnq…s97  1791: George Washington 1791-10-25 00:00:00
 4 R3q…s97  1792: George Washington 1792-11-06 00:00:00
 5 SHq…s97  1793: George Washington 1793-12-03 00:00:00
 6 SXq…s97  1794: George Washington 1794-11-19 00:00:00
 7 Snq…s97  1795: George Washington 1795-12-08 00:00:00
 8 S3q…s97  1796: George Washington 1796-12-07 00:00:00
 9 THq…s97  1797: John Adams        1797-11-22 00:00:00
10 TXq…s97  1798: John Adams        1798-12-08 00:00:00
# ℹ 190 more rows

Dataset

Load data from Rauh and Schwalbach (2020) into AmCAT:

corp_hoc_df <- readRDS("../03_Working_with_Files/data/Corp_HouseOfCommons_V2.rds") |> 
  mutate(date = as.Date(date)) |> 
  rename(title = agenda) |> 
  filter(!is.na(date)) |> 
  replace_na(list(title = "", text = "")) |> 
  select(-party.facts.id)

# define types of fields
fields = list(
  date = "date",
  text = "text",
  title = "text",        
  speechnumber = "integer",   
  speaker = "keyword",       
  party = "keyword",
  chair = "boolean",      
  terms = "integer",          
  parliament = "keyword", 
  iso3country = "keyword"
)
# create the index
create_index(index = "houseofcommons", 
             name = "House Of Commons", 
             description = "HouseOfCommons", 
             create_fields = fields)

# upload the data
upload_documents("houseofcommons", documents = corp_hoc_df, chunk_size = 1000, verbose = TRUE)

AmCAT GUI

AmCAT API

  • R and Python packages
  • OpenAPI specifications
  • Search, upload, download, modify data (e.g., add keywords/categories)
  • User and access management
  • Easy way to make research reproducible by querying data for analysis
  • Great for larger than memory datasets

AmCAT API

query_documents(index = "houseofcommons", queries = "*Europ*")
# A tibble: 200 × 3
   .id      date                title                                           
   <id_col> <dttm>              <chr>                                           
 1 G4i…IIz  2004-07-20 00:00:00 Iraq [Foreign And Commonwealth Affairs > Defenc…
 2 lIi…IIz  2004-07-20 00:00:00 Iraq [Foreign And Commonwealth Affairs > Defenc…
 3 xIi…IIz  2004-07-20 00:00:00 Energy Bill                                     
 4 0Ii…IIz  2004-07-20 00:00:00 Mr Ross Donovan [Petitions]                     
 5 04i…IIz  2004-07-20 00:00:00 Mr Ross Donovan [Petitions]                     
 6 14i…IIz  2004-07-20 00:00:00 Mr Ross Donovan [Petitions]                     
 7 2Ii…IIz  2004-07-20 00:00:00 Denominational Schools                          
 8 84i…IIz  2004-07-20 00:00:00 Public Corporations                             
 9 AIi…IMz  2004-07-20 00:00:00 International Court Of Justice Ruling (Israel)  
10 AYi…IMz  2004-07-20 00:00:00 International Court Of Justice Ruling (Israel)  
# ℹ 190 more rows
query_documents(index = "houseofcommons", queries = '"European Union" OR EU')
# A tibble: 200 × 3
   .id      date                title                                           
   <id_col> <dttm>              <chr>                                           
 1 94i…IIz  2004-07-20 00:00:00 Public Corporations                             
 2 AIi…IMz  2004-07-20 00:00:00 International Court Of Justice Ruling (Israel)  
 3 AYi…IMz  2004-07-20 00:00:00 International Court Of Justice Ruling (Israel)  
 4 Coi…IMz  2004-07-20 00:00:00 International Court Of Justice Ruling (Israel)  
 5 DIi…IMz  2004-07-20 00:00:00 International Court Of Justice Ruling (Israel)  
 6 94i…IM0  2004-07-21 00:00:00 Fishery Limits (United Kingdom) [Fishery Limits…
 7 fYi…YQZ  2004-07-21 00:00:00 Devolution [Fishery Limits (United Kingdom)]    
 8 xoi…YQZ  2004-07-21 00:00:00 Fire And Rescue National Framework [Clause 21]  
 9 HIi…YUa  2004-07-21 00:00:00 Sudan (Peace Agreement)                         
10 IYi…YUa  2004-07-21 00:00:00 Sudan (Peace Agreement)                         
# ℹ 190 more rows
query_aggregate(index = "houseofcommons", 
                axes = list(list(field="party", list(field="date", interval="year"))),
                queries = '"European Union" OR EU')
# A tibble: 18 × 2
   party                         n
   <chr>                     <int>
 1 other                        45
 2 UUP                         227
 3 UKIP                         49
 4 The Independents              4
 5 SNP                        7201
 6 SDP                           1
 7 SDLP                        266
 8 Respect                       3
 9 PlaidCymru                  889
10 LibDem                     6217
11 Lab                       42873
12 Independent                 485
13 GPEW                        272
14 DUP                        1316
15 Con                       65304
16 Change UK                   204
17 Birkenhead Social Justice     4
18 APNI                         16

Access Control in Amcat4 enables Non-Consumptive Research

  • Sharing data helps scientific progress!
  • Copyright and/or (privacy-)sensitive material can not be easily shared
  • Definition: non-consumptive research involves the use of computational methods to analyze data without giving access to the data itself

Scaling your research

Benchmarking

  • when your data grows, it becomes important that your code is fast, since it is applied to many cases
  • benchmarking helps you identify bottlenecks
  • it can make the difference between minutes and days whether your code is optimised for speed

Benchmarking with bench

Basic syntax:

  1. define functions that wrap your code
library(bench)
fun1 <- function() {
  1 + 1
}
fun2 <- function() {
  a <- 1
  b <- 1
  sum(c(a, b))
}
  1. run mark
res <- mark(simple = fun1, complex = fun1)
  1. check summary
summary(res)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 simple            0     71ns  2711522.        0B        0
2 complex           0    490ns  1744177.        0B        0
summary(res, relative = TRUE)
# A tibble: 2 × 6
  expression   min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
1 simple       NaN    1        1.55       NaN      NaN
2 complex      NaN    6.9      1          NaN      NaN

Benchmarking: in memory vs in database

  1. define functions that wrap your code
in_memory <- function() {
  nycflights13::flights |> 
    inner_join(nycflights13::weather, by = c("time_hour", "origin")) |> 
    filter(time_hour < "2013-03-01")
}

in_db <- function() {
  tbl(db, "flights") |> 
    inner_join(tbl(db, "weather"), by = c("time_hour", "origin")) |> 
    filter(time_hour < "2013-03-01") |> 
    collect()
}
  1. run mark
res <- mark(in_memory = in_memory, in_db = in_db, check = FALSE, iterations = 15)
  1. check summary
summary(res)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 in_memory         0     71ns  7614313.        0B        0
2 in_db             0     70ns  6709282.        0B        0
summary(res, relative = TRUE)
# A tibble: 2 × 6
  expression   min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
1 in_memory    NaN   1.01      1.13       NaN      NaN
2 in_db        NaN   1         1          NaN      NaN

Fix bottlenecks

Without an index, the search in the database takes longer. You can create it automatically with opy_to() (see this)

dbExecute(db, "CREATE INDEX ON flights (time_hour)")
[1] 0
dbExecute(db, "CREATE INDEX ON weather (time_hour)")
[1] 0
  1. define functions that wrap your code
in_memory <- function() {
  nycflights13::flights |> 
    inner_join(nycflights13::weather, by = c("time_hour", "origin")) |> 
    filter(time_hour < "2013-03-01")
}

in_db <- function() {
  tbl(db, "flights") |> 
    inner_join(tbl(db, "weather"), by = c("time_hour", "origin")) |> 
    filter(time_hour < "2013-03-01") |> 
    collect()
}
  1. run mark
res <- mark(in_memory = in_memory, in_db = in_db, check = FALSE, iterations = 15)
  1. check summary
summary(res)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 in_memory         0    419ns  2386641.        0B        0
2 in_db             0    419ns  2441108.        0B        0
summary(res, relative = TRUE)
# A tibble: 2 × 6
  expression   min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
1 in_memory    NaN   1.00      1          NaN      NaN
2 in_db        NaN   1         1.02       NaN      NaN

Data Management: Summary

What for?

  • enables new research questions by being able to combine and clean new data
  • let’s you sleep easy at night
  • makes your work transparent to others and yourself
  • thinking about reproducibility from the start

When to use files vs. databases

  • often not clear from the start
  • working with files is easier as a solo researchers (who keeps a clean project directory)
  • when data changes regularly it makes sense to not add to the same file again and again
  • when several researchers collaborate, a database solves many issues
  • when you have text data, vectors, or complex elements, NoSQL databases like AmCAT or MongoDB can make sense

When done, dbDisconnect

Whenever you are done working with a database, you should disconnect from it:

dbDisconnect(db)
dbDisconnect(db_analyser)
dbDisconnect(db_gatherer)
dbDisconnect(db_reader)
  • This closes the connection, discards all pending work, and frees resources

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bench_1.1.3         amcat4r_4.0.14.9000 nycflights13_1.0.2 
 [4] RPostgres_1.4.7     DBI_1.2.2           lubridate_1.9.3    
 [7] forcats_1.0.0       stringr_1.5.1       dplyr_1.1.4        
[10] purrr_1.0.2         readr_2.1.5         tidyr_1.3.1        
[13] tibble_3.2.1        ggplot2_3.5.1       tidyverse_2.0.0    
[16] tinytable_0.3.0.10 

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3    utf8_1.2.4        generics_0.1.3    lattice_0.22-6   
 [5] stringi_1.8.4     hms_1.1.3         digest_0.6.35     magrittr_2.0.3   
 [9] evaluate_0.23     grid_4.4.1        timechange_0.3.0  fastmap_1.1.1    
[13] blob_1.2.4        Matrix_1.7-0      jsonlite_1.8.8    mgcv_1.9-1       
[17] fansi_1.0.6       scales_1.3.0      httr2_1.0.1       cli_3.6.3        
[21] rlang_1.1.4       dbplyr_2.5.0      bit64_4.0.5       munsell_0.5.1    
[25] splines_4.4.1     withr_3.0.0       yaml_2.3.8        tools_4.4.1      
[29] tzdb_0.4.0        colorspace_2.1-0  profmem_0.6.0     curl_5.2.1       
[33] vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   bit_4.0.5        
[37] pkgconfig_2.0.3   pillar_1.9.0      gtable_0.3.5      glue_1.7.0       
[41] xfun_0.44         tidyselect_1.2.1  rstudioapi_0.16.0 knitr_1.46       
[45] farver_2.1.2      htmltools_0.5.8.1 nlme_3.1-164      labeling_0.4.3   
[49] rmarkdown_2.26    compiler_4.4.1    askpass_1.2.0     openssl_2.2.0    

References

Rauh, Christian, and Jan Schwalbach. 2020. The ParlSpeech V2 data set: Full-text corpora of 6.3 million parliamentary speeches in the key legislative chambers of nine representative democracies.” Harvard Dataverse. https://doi.org/10.7910/DVN/L4OAKN.
Weidmann, Nils B. 2023. Data Management for Social Scientists: From Files to Databases. 1st ed. Cambridge University Press. https://doi.org/10.1017/9781108990424.